home *** CD-ROM | disk | FTP | other *** search
- Path: ix.netcom.com!news
- From: jlilley@ix.netcom.com (John Lilley)
- Newsgroups: comp.lang.c++
- Subject: Re: Special sort algorithm
- Date: 29 Mar 1996 16:05:48 GMT
- Organization: Netcom
- Message-ID: <4jh1os$scq@dfw-ixnews6.ix.netcom.com>
- References: <31593E09.29A0@kmd.dk>
- NNTP-Posting-Host: den-co11-03.ix.netcom.com
- Mime-Version: 1.0
- Content-Type: Text/Plain; charset=US-ASCII
- X-NETCOM-Date: Fri Mar 29 10:05:48 AM CST 1996
- X-Newsreader: WinVN 0.99.7
-
- >It must :
- > - Sort big files of short lines
- > lines are uneven length
- > files are XX MB
- > lines are < 100 chars
- > - Put resulting file in original file
- > - Run in very little memory ( < 50 K)
- > - Use about no extra disk space ( < 10% extra)
- > - Run reasonably fast
- >
- >I have tried implementing in-file bubblesort and others with the language
- >being C++. But they are way too slow, even with some optimizing buffers.
-
- Have you tried in-place QuickSort? It would only be NlogN instead
- of N*N. But assuming you've tried that... This will work with about
- 20% extra disk space, and should be faster than the in-place quicksort.
-
- Since you've got the disk space limitation, I would try the following:
-
- 1) Create an index file for the file you want to sort (e.g., a file
- containing binary 4-byte integers, one per line, set to the
- offset within the source file of the beginning of each line.
-
- 2) Sort the index file (i.e. treat the indices as pointers into the
- source file and sort the indices such that the corresponding line
- are sorted), probably using a disk-based mergesort. For
- files with an everage line length > 40 characters, the index file
- will be 10% of the source file size. The intermediate files for
- the mergesort will take an additional 10%. It will still be
- slow because you will be performing NlogN random reads in the
- source on the second and subsequent passes of the mergesort,
- but that is hopefully better than NlogN reads AND writes.
- The mergesort itself requires NlogN *sequential* reads/writes
- of the index files, which is far better than random accesses.
-
- 3) Once the index file is sorted, create the sorted source file
- based on the index file. This will involve N random
- reads/writes in the source file, and another 10N sequential
- reads/writes (which is hopefully better than the NlogN random
- reads/writes needed by an in-place quicksort).
- You will also need to use another
- disk buffer (but you already had to use another disk buffer
- for the mergesort intermediate files, so you might as well
- reuse that space). Walk through the sorted index file, and
- for each line indexed by those integers, find that line in
- the source file, and write it over the beginning of the source
- file. As you do so, to avoid destroying lines that would be
- overwritten, write them to the disk buffer and keep track of
- M, where M is the number of lines that have been moved to
- the disk buffer. When the disk buffer reaches its 10% limit,
- compact the source file by removing all lines that have been
- moved to the beginning, adjust the indices, and then append
- the lines from the disk buffer that have not been written to
- the beginning of the source file, and adjust *those* indices
- as well. Keep going until all indices have been processed.
-
- Disclaimer: This is ad-hoc. There is almost certainly a better
- way to do this described somewhere in the literature.
-
- john lilley
-
-